Towards SMS Spam Filtering: Results under a New Dataset
نویسندگان
چکیده
The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. Recent reports clearly indicate that the volume of mobile phone spam is dramatically increasing year by year. In practice, fighting such plague is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. Probably, one of the major concerns in academic settings is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, traditional content-based filters may have their performance seriously degraded since SMS messages are fairly short and their text is generally rife with idioms and abbreviations. In this paper, we present details about a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we offer a comprehensive analysis of such dataset in order to ensure that there are no duplicated messages coming from previously existing datasets, since it may ease the task of learning SMS spam classifiers and could compromise the evaluation of methods. Additionally, we compare the performance achieved by several established machine learning techniques. In summary, the results indicate that the procedure followed to build the collection does not lead to near-duplicates and, regarding the classifiers, the Support Vector Machines outperforms other evaluated techniques and, hence, it can be used as a good baseline for further comparison. Keywords—Mobile phone spam; SMS spam; spam filtering; text categorization; classification.
منابع مشابه
An Effective Model for SMS Spam Detection Using Content-based Features and Averaged Neural Network
In recent years, there has been considerable interest among people to use short message service (SMS) as one of the essential and straightforward communications services on mobile devices. The increased popularity of this service also increased the number of mobile devices attacks such as SMS spam messages. SMS spam messages constitute a real problem to mobile subscribers; this worries telecomm...
متن کاملSMS Spam Filtering Technique Based on Artificial Immune System
The Short Message Service (SMS) have an important economic impact for end users and service providers. Spam is a serious universal problem that causes problems for almost all users. Several studies have been presented, including implementations of spam filters that prevent spam from reaching their destination. Naïve Bayesian algorithm is one of the most effective approaches used in filtering te...
متن کاملSMS spam filtering: Methods and data
Mobile or SMS spam is a real and growing problem primarily due to the availability of very cheap bulk pre-pay SMS packages and the fact that SMS engenders higher response rates as it is a trusted and personal service. SMS spam filtering is a relatively new task which inherits many issues and solutions from email spam filtering. However it poses its own specific challenges. This paper motivates ...
متن کاملSMS Spam Detection using Machine Learning Approach
Over recent years, as the popularity of mobile phone devices has increased, Short Message Service (SMS) has grown into a multi-billion dollars industry. At the same time, reduction in the cost of messaging services has resulted in growth in unsolicited commercial advertisements (spams) being sent to mobile phones. In parts of Asia, up to 30% of text messages were spam in 2012. Lack of real data...
متن کاملA Bi-Level Text Classification Approach for SMS Spam Filtering and Identifying Priority Messages
Short Message Service (SMS) traffic is increasing day by day and trillions of sms are sent and received by billions of users every day. Spam messages are also increasing in same proportionate. Numbers of recent advancements are taking place in the field of sms spam detection and filtering. The objective of this work is twofold, first is to identify and classify spam messages from the collection...
متن کامل